Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Combination of autocreate + forced delete of partitioned topic with active consumer leaves topic metadata inconsistent. #18193

Merged
merged 3 commits into from
Oct 28, 2022

Conversation

dlg99
Copy link
Contributor

@dlg99 dlg99 commented Oct 25, 2022

Local fork PR: dlg99#2

Motivation

Forced delete of partitioned topic with active consumer on the namespace where the topic autocreate is enabled leaves the namespace in the state where one cannot create partitioned topic with the same name (because it exists already) and cannot delete it (because it does not exist at the same time)

#17308 -> #17566 -> this rework of 17308

Modifications

Prevents autocreation of the topic partitions during deletion. Correctly detects partitioned topic as non-existing if namespace level metadata is deleted.

Verifying this change

This change added tests

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

nothing AFAICT

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed

Bug fix

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 25, 2022
@dlg99 dlg99 requested a review from eolivelli October 25, 2022 16:52
@dlg99
Copy link
Contributor Author

dlg99 commented Oct 25, 2022

@poorbarcode @eolivelli Please take a look

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Nice work!

Please check my comment about the usage of logger

@codecov-commenter
Copy link

codecov-commenter commented Oct 25, 2022

Codecov Report

Merging #18193 (78b899e) into master (6c65ca0) will increase coverage by 10.89%.
The diff coverage is 61.99%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #18193       +/-   ##
=============================================
+ Coverage     34.91%   45.81%   +10.89%     
- Complexity     5707    17642    +11935     
=============================================
  Files           607     1579      +972     
  Lines         53396   128785    +75389     
  Branches       5712    14157     +8445     
=============================================
+ Hits          18644    59001    +40357     
- Misses        32119    63693    +31574     
- Partials       2633     6091     +3458     
Flag Coverage Δ
unittests 45.81% <61.99%> (+10.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/bookkeeper/mledger/LedgerOffloader.java 5.55% <ø> (ø)
...che/bookkeeper/mledger/LedgerOffloaderFactory.java 0.00% <ø> (ø)
...pache/bookkeeper/mledger/LedgerOffloaderStats.java 0.00% <ø> (ø)
...ookkeeper/mledger/LedgerOffloaderStatsDisable.java 0.00% <ø> (ø)
...a/org/apache/bookkeeper/mledger/ManagedCursor.java 85.71% <ø> (ø)
...che/bookkeeper/mledger/ManagedLedgerException.java 67.64% <ø> (ø)
...bookkeeper/mledger/ManagedLedgerFactoryConfig.java 86.66% <ø> (ø)
...g/apache/bookkeeper/mledger/ManagedLedgerInfo.java 77.77% <ø> (ø)
...he/bookkeeper/mledger/OffloadedLedgerMetadata.java 0.00% <ø> (ø)
...ava/org/apache/bookkeeper/mledger/ScanOutcome.java 100.00% <ø> (ø)
... and 1301 more

@codelipenghui codelipenghui added this to the 2.12.0 milestone Oct 26, 2022
@codelipenghui codelipenghui added type/bug The PR fixed a bug or issue reported a bug area/broker labels Oct 26, 2022
@codelipenghui
Copy link
Contributor

@poorbarcode Please help review this PR. Thanks.

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-phase delete, that's a good way to do it. That's what I thought at first, but I think we need to improve like this:


1. Resolve the problem that cache updates are not timely.

  • Also two-phase delete like this PR, but there is no need to persist the meta.deleted, just maintains in memory, the detail like below:
broker-1( service unit owner ) broker-2 (owner of topic_xxx-partition-0 )
delete partitioned topic, mark this topic is deleting (do not need to persist the meta)
load topic meta
trigger delete topic-xxx-partition-0
delete topic-xxx-partition-0
..
receive cmd-subscribe from client
create topic-xxx-partion-0 start
send an HTTP request to broker-1 to ask for the delete-state, then create will fail because this partitioned topic is deleting
delete meta of partitioned topic

2. Resolve the problem that the client has invalid topic meta data

(High light)There still have this problem:

broker client
create multi partioned consumer
lookup topic
get the meta
delete partitioned topic
all process of delete topic finished
subscribe
create topic topic_xxx-partition-0, topic_xxx-partition-1, (Hight light)but there has no meta for this partitioned topic, bingo!

We can resolve this problem like this:

The client tells the broker whether this is a partion of a partioned topic when it executes cmd-subscribe and cmd-producer. Then broker receives this request and determines that the topic_xxx-partition-0 to be created does not have any meta, then response "topic does not exists".


3. Another comprehensive solution to this two problem ( Recommended )

We can think of the creation timestamp of a partitioned topic as the version number of this partitioned topic, and update the version first when executing delete( It can be used as the state-delete ). The following process shows how version works:

broker-1( service unit owner ) broker-2 (owner of topic_xxx-partition-0 )` client
create partitioned topic, the version is timestamp 2022-10-27 00:00:00 000
delete partitioned topic start ==>
change version to 2022-10-27 00:00:11 111 and persistent the new version( do not need design a mechanism to reset it even if broker down )
load topic meta, the version is 2022-10-27 00:00:00 000
lookup topic, receive the topic meta, the version is 2022-10-27 00:00:00 000
trigger delete topic_xxx-partition-0, carry the new version in the request
receive cmd delete topic_xxx-partition-0 from broker-1
do delete delete topic_xxx-partition-0
update the version in memory to 2022-10-27 00:00:11 111
do subscribe, carry the version 2022-10-27 00:00:00 000
receive cmd-subscribe
If version is carried, so it is a partition of a partitioned topic, and then verify that the version is out of date
response error
delete finished <==

}
return null;
});
isAllowAutoTopicCreationAsync(topicName, policies).thenAccept(allowed -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we make sure the cache data is not outdated? E.g:

broker-1 broker-2
read meta from cache
delete topic tp_test
write meta
create tp_test-partion-0
cache update
create finish. bingo!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poorbarcode
createDefaultPartitionedTopicAsync calls createPartitionedTopicAsync which will produce AlreadyExistsException if PartitionedTopic metadata still exists (no matter the state of deleted property) and skip topic deletion/reread metadata.

It is ok if create tp_test-partion-0 already past the metadata check state. broker-1 issues delete tp_test-partion-0 after the PartitionedTopic metadata update (mark deleted) anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache stateless is a bigger problem and I would keep that out of the scope this patch.
This patch is following the way we do for all multi broker operations (like delete namespace)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poorbarcode PTAL here #18518


supplier.get().whenComplete((deleteResult, deleteExc) -> {
if (deleteExc != null && mdFound) {
unmarkPartitionedTopicDeletedAsync(topic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the broker is down at this point, we should design a mechanism to reset the meta of this topic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poorbarcode can simply delete the topic again.

@dlg99
Copy link
Contributor Author

dlg99 commented Oct 26, 2022

Two-phase delete, that's a good way to do it.

Maybe after the #16590 is done we can explore this option when it becomes available / stable.

Current approach should be good enough (and cherry-pickable into 2.10, if needed).

@eolivelli eolivelli merged commit 3fdbc9f into apache:master Oct 28, 2022
dlg99 added a commit to dlg99/pulsar that referenced this pull request Nov 8, 2022
…with active consumer leaves topic metadata inconsistent. (apache#18193)

(cherry picked from commit 3fdbc9f)
dlg99 added a commit to datastax/pulsar that referenced this pull request Nov 8, 2022
…with active consumer leaves topic metadata inconsistent. (apache#18193)

(cherry picked from commit 3fdbc9f)
@dlg99 dlg99 deleted the delete_fails2 branch November 28, 2022 05:21
@michaeljmarshall
Copy link
Member

michaeljmarshall commented Dec 20, 2022

@dlg99 - should this have been cherry picked to older release branches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants